26 research outputs found

    Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

    Full text link
    We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power α=1/2\alpha=1/2 and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime (Wei and Luo), and stochastic regime with adversarial corruptions (Lykouris et al.) as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the adversarial regret guarantee.} The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last, but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for α[0,1]\alpha\in[0,1] and explain the reason why α=1/2\alpha=1/2 works best

    Factored Bandits

    Full text link
    We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but significantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to constants matching upper and lower regret bounds for the problem. Furthermore, we show that with a slight modification the proposed algorithm can be applied to utility based dueling bandits. We obtain an improvement in the additive terms of the regret bound compared to state of the art algorithms (the additive terms are dominating up to time horizons which are exponential in the number of arms)

    An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

    Full text link
    We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves O(kn+Dlog(k))\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)}) regret guarantee, where kk is the number of arms, nn is the number of rounds, and DD is the total delay. The result matches the lower bound within constants and requires no prior knowledge of nn or DD. Additionally, we propose a refined tuning of the algorithm, which achieves O(kn+minSS+DSˉlog(k))\mathcal{O}(\sqrt{kn}+\min_{S}|S|+\sqrt{D_{\bar S}\log(k)}) regret guarantee, where SS is a set of rounds excluded from delay counting, Sˉ=[n]S\bar S = [n]\setminus S are the counted rounds, and DSˉD_{\bar S} is the total delay in the counted rounds. If the delays are highly unbalanced, the latter regret guarantee can be significantly tighter than the former. The result requires no advance knowledge of the delays and resolves an open problem of Thune et al. (2019). The new FTRL algorithm and its refined tuning are anytime and require no doubling, which resolves another open problem of Thune et al. (2019)

    Connections Between Mirror Descent, Thompson Sampling and the Information Ratio

    Get PDF
    The information-theoretic analysis by Russo and Van Roy (2014) in combination with minimax duality has proved a powerful tool for the analysis of online learning algorithms in full and partial information settings. In most applications there is a tantalising similarity to the classical analysis based on mirror descent. We make a formal connection, showing that the information-theoretic bounds in most applications can be derived from existing techniques for online convex optimisation. Besides this, for kk-armed adversarial bandits we provide an efficient algorithm with regret that matches the best information-theoretic upper bound and improve best known regret guarantees for online linear optimisation on p\ell_p-balls and bandits with graph feedback

    Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits

    Full text link
    We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i.e. the context) is drawn from a fixed distribution. Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret no better than O~(T56)\widetilde{O}(T^{\frac{5}{6}}), or are computationally inefficient. We greatly improve these results by achieving a regret of O~(T)\widetilde{O}(\sqrt{T}) without a simulator, while maintaining computational efficiency when the action set in each round is small. In the special case of sleeping bandits with adversarial loss and stochastic arm availability, our result answers affirmatively the open question by Saha et al. [2020] on whether there exists a polynomial-time algorithm with poly(d)Tpoly(d)\sqrt{T} regret. Our approach naturally handles the case where the loss is linear up to an additive misspecification error, and our regret shows near-optimal dependence on the magnitude of the error

    Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

    Full text link
    We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of O~(K)\widetilde{\mathcal{O}}\left(\sqrt{K}\right), where KK is the number of episodes. This is the first result with the optimal KK dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of O~(K34)\widetilde{\mathcal{O}}\left(K^{\frac{3}{4}} \right) and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with O~(K45+poly(1λmin))\widetilde{\mathcal{O}}\left(K^{\frac{4}{5}}+poly\left(\frac{1}{\lambda_{\min}}\right) \right) regret, for some problem-dependent constant λmin\lambda_{\min} that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with O~(K67)\widetilde{\mathcal{O}}\left(K^{\frac{6}{7}} \right) regret

    Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

    Full text link
    We develop the first general semi-bandit algorithm that simultaneously achieves O(logT)\mathcal{O}(\log T) regret for stochastic environments and O(T)\mathcal{O}(\sqrt{T}) regret for adversarial environments without knowledge of the regime or the number of rounds TT. The leading problem-dependent constants of our bounds are not only optimal in some worst-case sense studied previously, but also optimal for two concrete instances of semi-bandit problems. Our algorithm and analysis extend the recent work of (Zimmert & Seldin, 2019) for the special case of multi-armed bandit, but importantly requires a novel hybrid regularizer designed specifically for semi-bandit. Experimental results on synthetic data show that our algorithm indeed performs well uniformly over different environments. We finally provide a preliminary extension of our results to the full bandit feedback

    Refined Regret for Adversarial MDPs with Linear Function Approximation

    Full text link
    We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over KK episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order O~(K2/3)\tilde{\mathcal O}(K^{2/3}) (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to O~(K)\tilde{\mathcal O}(\sqrt K) in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves O~(K8/9)\tilde{\mathcal O}(K^{8/9}) regret and greatly improves over the best existing bound O~(K14/15)\tilde{\mathcal O}(K^{14/15}). This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.Comment: Accepted to ICML 202

    Model Selection in Contextual Stochastic Bandit Problems

    Full text link
    We study model selection in stochastic bandit problems. Our approach relies on a master algorithm that selects its actions among candidate base algorithms. While this problem is studied for specific classes of stochastic base algorithms, our objective is to provide a method that can work with more general classes of stochastic base algorithms. We propose a master algorithm inspired by CORRAL \cite{DBLP:conf/colt/AgarwalLNS17} and introduce a novel and generic smoothing transformation for stochastic bandit algorithms that permits us to obtain O(T)O(\sqrt{T}) regret guarantees for a wide class of base algorithms when working along with our master. We exhibit a lower bound showing that even when one of the base algorithms has O(logT)O(\log T) regret, in general it is impossible to get better than Ω(T)\Omega(\sqrt{T}) regret in model selection, even asymptotically. We apply our algorithm to choose among different values of ϵ\epsilon for the ϵ\epsilon-greedy algorithm, and to choose between the kk-armed UCB and linear UCB algorithms. Our empirical studies further confirm the effectiveness of our model-selection method.Comment: 12 main pages, 2 figures, 14 appendix page
    corecore